Make Char.isLower and Char.isUpper Unicode-aware #970

Janiczek · 2018-07-24T21:59:16Z

This allows people with non-ASCII alphabets work with Char.isLower and Char.isUpper. Uses toUpper and toLower underneath, which use Javascript's String.prototype.toLower/UpperCase().

The second condition in the functions is there to distinguish between characters that have an upper/lower-case pairing, and those that don't ('0' == Char.toLower '0' but we don't want isLower '0' to be true).

EDIT: I can't update code of this PR anymore; there is #1138 with a fix for the == and /= import.

This allows people with non-ASCII alphabets work with `Char.isLower` and `Char.isUpper`. Uses `toUpper` and `toLower` underneath, which use Javascript's `String.prototype.toLower/UpperCase()`. The second condition in the functions is there to distinguish between characters that have an upper/lower-case pairing, and those that don't (`'0' == Char.toLower '0'` but we don't want `isLower '0'` to be true).

Janiczek · 2018-07-24T22:00:12Z

Is related to #385.

drathier · 2018-07-26T15:16:18Z

What's considered an uppercase character depends on your locale. This PR is still a major improvement.

Related to #942.

evancz · 2018-07-27T20:20:21Z

For future reference, the toLocaleUpperCase function talks about cases where this will break:

The toLocaleUpperCase() method returns the value of the string converted to upper case according to any locale-specific case mappings. toLocaleUpperCase() does not affect the value of the string itself. In most cases, this will produce the same result as toUpperCase(), but for some locales, such as Turkish, whose case mappings do not follow the default case mappings in Unicode, there may be a different result.

So it seems that toUpperCase() is a pure function, but toLocaleUpperCase() is not. My instinct is that the "correct" version of this function takes Language as an argument. (Not sure if the idea of "locale" is better. Maybe it is geographical? Maybe there is some standards body that defines locales?)

I do not want us to theorize about these things here. The next step is to find nice links that describe:

How "upper case" is defined by unicode. Is there a big table somewhere?
How a "locale" is defined and who manages that. Are there "new locales" if human culture changes? Who captures that, and how do browsers know about it?

I would prefer to understand the problem more completely before changing things.

Janiczek · 2018-07-27T21:36:16Z

From my cursory googling and research:

1. How is "upper case" defined?

I think this FAQ is the link you want.

In short, yes, there is a big table. Three, in fact.

ftp://ftp.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt
ftp://ftp.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt (GitHub or Markdown doesn't support FTP links 🤷‍♂️ )

Here is the relevant section of standard.

It has some sense of inter-version stability between the Unicode versions.

2. How is "locale" defined?

Again, an Unicode FAQ; and this time there's a whole homepage.

You can download the current version, there are a lot of XML files inside with various data (casing of dates / languages / ..., etc.), to be interpreted according to LDML.

They are also transformed from the XML into JSON, which might be a better fit for Elm?

Janiczek · 2018-07-28T22:28:46Z

We might try to be extra-pure and host the big table, in Elm format, ourselves, but that would make elm/core very big, I imagine. The browser already has that cached in the form of .toLowerCase().

The .toLocaleLowerCase() functions would benefit from the Language argument (to become pure), but I wonder if it's important. In what situation would the function start misbehaving? (From the top of my head, computer location changing? System settings changing?) And is it important, would it affect the user somehow?

I mean, even Date, when toStringed, will show different things on different machines, based on your timezone.

main : Html msg
main =
    "2018-05-10"
        |> Date.fromString
        |> toString
        |> Html.text

shows Ok <Thu May 10 2018 02:00:00 GMT+0200 (Central European Summer Time)> on my machine. (Ellie) It will presumably show something different on yours. It's not pure. Is that problematic?

gormonn · 2020-10-28T16:54:35Z

I ran into the same issue when doing the exercise in the forms section, namely checking the uppercase password.
(btw, it's sad to see such behavior even before deep acquaintance with the language.)

At first glance, I thought this was a serious omission.

However, then I wondered if it was worth letting users set their passwords to Unicode.
On the one hand, this password be more resistant to brute force.
On the other hand, it may be inconvenient for the user if the device does not have a specific locale.

However, in any case, this is not decided at the stage of front-end approval, but much earlier. Thus, this limitation may cause frustration for developers from regions other than English. And negatively affects the use of Elm as the main front-end stack in the Enterprise environment. This means about the popularity and development of the language.

But I believe that such an annoying flaw will still not be a problem for Elm.

Meanwhile.
I’m very curious if I’m stuck in mind trap just because Elm doesn’t support my native language well. There may be other uses for Unicode and Char.isUpper that we are not aware of. So write if you know this.

avh4 · 2020-10-28T16:58:43Z

FYI, this package can currently be used to deal with unicode strings: https://package.elm-lang.org/packages/BrianHicks/elm-string-graphemes/latest/

sagehane · 2022-01-18T04:08:39Z

core/src/Char.elm

Lines 84 to 85 in e47edeb

    
           (char == Char.toUpper char) 
        
             && (char /= Char.toLower char)

Wouldn't char /= Char.toLower char suffice? If a character does not equal its lowercase counterpart, it must be uppercase.

A would become a, True
a would become a, False
0 would become 0, False

Same applies to isLower.

(I tried using code comments, didn't work, idk why)

Edit: The one scenario where this might make a difference is if there's a "middle case" character that has both an upper and a lower case variant. But I don't think such a character exists, and even if it does, should isUpper return a True or False in such a scenario?

This does not fully solve the problem of detecting case in Unicode, as it can also vary by locale. This does make the isUpper/Lower and toUpper/Lower functions consistent. Make Char.isLower and Char.isUpper Unicode-aware This allows people with non-ASCII alphabets work with `Char.isLower` and `Char.isUpper`. Uses `toUpper` and `toLower` underneath, which use Javascript's `String.prototype.toLower/UpperCase()`. The second condition in the functions is there to distinguish between characters that have an upper/lower-case pairing, and those that don't (`'0' == Char.toLower '0'` but we don't want `isLower '0'` to be true).

miniBill · 2022-02-04T15:24:34Z

@sagehane

> let s = Char.fromCode 453 in (s, Char.toUpper s, Char.toLower s)
('ǅ','Ǆ','ǆ') : ( Char, Char, Char )

sagehane · 2022-02-04T16:19:49Z

@miniBill, good to know. So, would you argue that 'ǅ'.isLower() should return true, or false? I feel like if a character has an uppercase form, it should be considered a lowercase character. That is, I disagree with the current code.

miniBill · 2022-02-04T16:23:56Z

ǅ is neither lowercase nor uppercase, according to Unicode

https://ellie-app.com/gBmgVVFzhbRa1 this contains a table of all the 1441 codepoints that give wrong results with the current proposal

I personally think the proposal is the best compromise between accuracy and size/speed. Getting better results belongs in external packages (like elm-unicode)

rupertlssmith · 2023-05-22T14:21:24Z

This PR causes a compile error when it is used: elm-janitor/apply-patches#1

Janiczek · 2023-05-22T20:49:40Z

@rupertlssmith I can't edit this PR's code anymore, see #1138 for compilable code.

This reverts commit ae7faa7. This patch was not correct and was causing a compiler error due to not explicitly import == and /= in Char.elm

Janiczek mentioned this pull request May 22, 2023

Make Char.isLower and Char.isUpper Unicode-aware (2nd try) #1138

Open

rupertlssmith added a commit to elm-janitor/core that referenced this pull request May 23, 2023

Revert "fixes elm#970"

86fec35

This reverts commit ae7faa7. This patch was not correct and was causing a compiler error due to not explicitly import == and /= in Char.elm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Char.isLower and Char.isUpper Unicode-aware #970

Make Char.isLower and Char.isUpper Unicode-aware #970

Janiczek commented Jul 24, 2018 •

edited

Loading

Janiczek commented Jul 24, 2018

drathier commented Jul 26, 2018 •

edited

Loading

evancz commented Jul 27, 2018 •

edited

Loading

Janiczek commented Jul 27, 2018

Janiczek commented Jul 28, 2018

gormonn commented Oct 28, 2020 •

edited

Loading

avh4 commented Oct 28, 2020

sagehane commented Jan 18, 2022 •

edited

Loading

miniBill commented Feb 4, 2022

sagehane commented Feb 4, 2022

miniBill commented Feb 4, 2022

rupertlssmith commented May 22, 2023

Janiczek commented May 22, 2023

Make Char.isLower and Char.isUpper Unicode-aware #970

Are you sure you want to change the base?

Make Char.isLower and Char.isUpper Unicode-aware #970

Conversation

Janiczek commented Jul 24, 2018 • edited Loading

Janiczek commented Jul 24, 2018

drathier commented Jul 26, 2018 • edited Loading

evancz commented Jul 27, 2018 • edited Loading

Janiczek commented Jul 27, 2018

1. How is "upper case" defined?

2. How is "locale" defined?

Janiczek commented Jul 28, 2018

gormonn commented Oct 28, 2020 • edited Loading

avh4 commented Oct 28, 2020

sagehane commented Jan 18, 2022 • edited Loading

miniBill commented Feb 4, 2022

sagehane commented Feb 4, 2022

miniBill commented Feb 4, 2022

rupertlssmith commented May 22, 2023

Janiczek commented May 22, 2023

Janiczek commented Jul 24, 2018 •

edited

Loading

drathier commented Jul 26, 2018 •

edited

Loading

evancz commented Jul 27, 2018 •

edited

Loading

gormonn commented Oct 28, 2020 •

edited

Loading

sagehane commented Jan 18, 2022 •

edited

Loading